04. Implementation

Implementation: MC Prediction (State Values)

The pseudocode for (first-visit) MC prediction (for the state values) can be found below. (Feel free to implement either the first-visit or every-visit MC method. In the game of Blackjack, both the first-visit and every-visit methods return identical results.)

If you are interested in learning more about the difference between first-visit and every-visit MC methods, you are encouraged to read Section 3 of [this paper](http://www-anw.cs.umass.edu/legacy/pubs/1995_96/singh_s_ML96.pdf
).
Their results are summarized in Section 3.6. The authors show:

  • Every-visit MC is biased, whereas first-visit MC is unbiased (see Theorems 6 and 7).
  • Initially, every-visit MC has lower mean squared error (MSE), but as more episodes are collected, first-visit MC attains better MSE (see Corollary 9a and 10a, and Figure 4).

Both the first-visit and every-visit method are guaranteed to converge to the true value function, as the number of visits to each state approaches infinity. (So, in other words, as long as the agent gets enough experience with each state, the value function estimate will be pretty close to the true value.) In the case of first-visit MC, convergence follows from the Law of Large Numbers, and the details are covered in section 5.1 of the textbook.

Please use the next concept to complete Part 0: Explore BlackjackEnv and Part 1: MC Prediction: State Values of Monte_Carlo.ipynb. Remember to save your work!

If you'd like to reference the pseudocode while working on the notebook, you are encouraged to open this sheet in a new window.

Feel free to check your solution by looking at the corresponding sections in Monte_Carlo_Solution.ipynb.